Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at https://github.com/SHI-Labs/OneFormer
translated by 谷歌翻译
Image generation has been a long sought-after but challenging task, and performing the generation task in an efficient manner is similarly difficult. Often researchers attempt to create a "one size fits all" generator, where there are few differences in the parameter space for drastically different datasets. Herein, we present a new transformer-based framework, dubbed StyleNAT, targeting high-quality image generation with superior efficiency and flexibility. At the core of our model, is a carefully designed framework that partitions attention heads to capture local and global information, which is achieved through using Neighborhood Attention (NA). With different heads able to pay attention to varying receptive fields, the model is able to better combine this information, and adapt, in a highly flexible manner, to the data at hand. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin, and a new transformer SOTA on FFHQ-1024 with an FID score of 4.174. These results show a 6.4% improvement on FFHQ-256 scores when compared to StyleGAN-XL with a 28% reduction in the number of parameters and 56% improvement in sampling throughput. Code and models will be open-sourced at https://github.com/SHI-Labs/StyleNAT .
translated by 谷歌翻译
变形金刚迅速成为跨模式,域和任务的最深入学习架构之一。在视觉上,除了对普通变压器的持续努力外,层次变压器还引起了人们的重大关注,这要归功于它们的性能和轻松整合到现有框架中。这些模型通常采用局部注意机制,例如滑动窗口社区的注意力(NA)或Swin Transformer转移的窗户自我关注。尽管有效地降低了自我注意力的二次复杂性,但局部注意力却削弱了自我注意力最理想的两个特性:远距离相互依赖性建模和全球接受场。在本文中,我们引入了扩张的邻里注意力(DINA),这是NA的天然,灵活和有效的扩展,可以捕获更多的全球环境,并以无需额外的成本呈指数级扩展接受场。 NA的本地关注和Dina的稀疏全球关注相互补充,因此我们引入了扩张的邻里注意力变压器(Dinat),这是一种新的分层视觉变压器。 Dinat变体对基于注意的基线(例如NAT和SWIN)以及现代卷积基线Convnext都具有重大改进。我们的大型模型在可可对象检测中以1.5%的盒子AP领先于其在COCO物体检测中,1.3%的掩码AP在可可实例分段中,而ADE20K语义分段中的1.1%MIOU和更快的吞吐量。我们认为,NA和Dina的组合有可能增强本文提出的各种任务的能力。为了支持和鼓励朝着这个方向,远见和超越方向进行研究,我们在以下网址开放我们的项目:https://github.com/shi-labs/neighborhood-cithention-transformer。
translated by 谷歌翻译
最近的研究表明,减少时间和空间冗余都是有效的视频识别方法的有效方法,例如,将大多数计算分配给与任务相关的框架或每个帧中最有价值的图像区域。但是,在大多数现有的作品中,任何一种类型的冗余通常都是用另一个缺失建模的。本文探讨了在最近提出的ADAFOCUSV2算法之上的时空动态计算的统一配方,从而有助于改进的ADAFOCUSV3框架。我们的方法仅在一些小但有益的3D视频立方体上激活昂贵的高容量网络来降低计算成本。这些立方体是从框架高度,宽度和视频持续时间形成的空间中裁剪的,而它们的位置则以每样本样本为基础的轻加权政策网络自适应地确定。在测试时间,与每个视频相对应的立方体的数量是动态配置的,即,对视频立方体进行顺序处理,直到产生足够可靠的预测为止。值得注意的是,可以通过近似可插入深度特征的插值来有效地训练adafocusv3。六个基准数据集(即ActivityNet,FCVID,Mini-Kinetics,Something Something V1&V2和潜水48)上的广泛经验结果表明,我们的模型比竞争性基线要高得多。
translated by 谷歌翻译
本文展示了一种新的方法,可以使用语义分段特征提高面部识别姿势不变。拟议的SEG-DISTILD-ID网络共同学习识别和语义分割任务,然后将分割任务“蒸馏”(Mobilenet编码器)。在强调头置变化的公开数据集中,针对三个最先进的编码器进行了基准测试。实验评估表明,SEG-DISTILD-ID网络显示出显着的鲁棒性优势,相比之下,RESNET-101的测试准确性达到99.9%,VGG-19的96.1%,IntectionV3的vgg-19和96.3%。这是使用顶部编码器推理参数的大约十分之一来实现的。这些结果表明,蒸馏的语义分割特征可以有效地解决面部识别姿势不变。
translated by 谷歌翻译
尽管模型压缩和多任务学习的流行程度,但由于参数空间中任务的挑战性纠缠,如何有效地压缩多任务模型的分析程度不太彻底。在本文中,我们提出了一种简单,有效且首先的多任务修剪和稀疏培训计划。我们通过解开重要性测量值并在执行参数修剪和选择时独立考虑每个任务。我们的实验结果表明,与流行的稀疏训练和修剪方法相比,各种配置和设置的性能都出色。除了压缩的有效性外,Disparse还为多任务学习社区提供了强大的工具。令人惊讶的是,尽管迪斯特尔斯(Disparse)实现了高模型的稀疏性,但在某些情况下,我们甚至观察到比某些专用的多任务学习方法更好的性能。我们分析了用拆卸生成的修剪口罩,并在训练开始之前就观察到了每个任务都标识的非常相似的稀疏网络体系结构。我们还观察到了一个“分水岭”层的存在,该层与任务相关性急剧下降,这意味着持续参数共享没有任何好处。我们的代码和模型将在以下网址提供:https://github.com/shi-labs/disparse-multitask-model-compression。
translated by 谷歌翻译
我们提出了邻里注意力变压器(NAT),这是一种有效,准确和可扩展的层次变压器,在图像分类和下游视觉任务上都很好地工作。它建立在邻里注意力(NA)的基础上,这是一种简单而灵活的注意机制,将每个查询的接受场都定位到其最近的相邻像素。 NA是自我注意的本地化,并且随着接收场大小的增加而接近它。在拖曳和记忆使用方面,它也等同于Swin Transformer的转移窗口的注意力,而同样的接收场大小,同时受到了较少的约束。此外,NA包括局部电感偏见,从而消除了对像素移位等额外操作的需求。 NAT的实验结果具有竞争力; Nat-tiny在Imagenet上仅具有4.3 GFLOPS和28M参数,在MS-Coco上达到51.4%的MAP和ADE20K上的48.4%MIOU。我们在:https://github.com/shi-labs/neighborhood-cithention-transformer上开放了检查点,代码和CUDA内核。
translated by 谷歌翻译
随着变压器作为语言处理的标准及其在计算机视觉方面的进步,参数大小和培训数据的数量相应地增长。许多人开始相信,因此,变形金刚不适合少量数据。这种趋势引起了人们的关注,例如:某些科学领域中数据的可用性有限,并且排除了该领域研究资源有限的人。在本文中,我们旨在通过引入紧凑型变压器来提出一种小规模学习的方法。我们首次表明,具有正确的尺寸,卷积令牌化,变压器可以避免在小数据集上过度拟合和优于最先进的CNN。我们的模型在模型大小方面具有灵活性,并且在获得竞争成果的同时,参数可能仅为0.28亿。当在CIFAR-10上训练Cifar-10,只有370万参数训练时,我们的最佳模型可以达到98%的准确性,这是与以前的基于变形金刚的模型相比,数据效率的显着提高,比其他变压器小于10倍,并且是15%的大小。在实现类似性能的同时,重新NET50。 CCT还表现优于许多基于CNN的现代方法,甚至超过一些基于NAS的方法。此外,我们在Flowers-102上获得了新的SOTA,具有99.76%的TOP-1准确性,并改善了Imagenet上现有基线(82.71%精度,具有29%的VIT参数)以及NLP任务。我们针对变压器的简单而紧凑的设计使它们更可行,可以为那些计算资源和/或处理小型数据集的人学习,同时扩展了在数据高效变压器中的现有研究工作。我们的代码和预培训模型可在https://github.com/shi-labs/compact-transformers上公开获得。
translated by 谷歌翻译
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systemsoriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized messagepassing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.
translated by 谷歌翻译
Existing automated techniques for software documentation typically attempt to reason between two main sources of information: code and natural language. However, this reasoning process is often complicated by the lexical gap between more abstract natural language and more structured programming languages. One potential bridge for this gap is the Graphical User Interface (GUI), as GUIs inherently encode salient information about underlying program functionality into rich, pixel-based data representations. This paper offers one of the first comprehensive empirical investigations into the connection between GUIs and functional, natural language descriptions of software. First, we collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications. The descriptions were obtained from human labelers and underwent several quality control mechanisms. To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input. We evaluate these models quantitatively, using common machine translation metrics, and qualitatively through a large-scale user study. Finally, we offer learned lessons and a discussion of the potential shown by multimodal models to enhance future techniques for automated software documentation.
translated by 谷歌翻译